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ABSTRACT 

Cube-DB is a database of pre-evaluated results for 
detection of functional divergence in human/verte- 
brate protein families. The analysis is organized 
around the nomenclature associated with the 
human proteins, but based on all currently available 
vertebrate genomes. Using full genomes enables us, 
through a mutual-best-hit strategy, to construct 
comparable taxonomical samples for all paralogues 
under consideration. Functional specialization is 
scored on the residue level according to two 
models of behavior after divergence: heterotachy 
and homotachy. In the first case, the positions on 
the protein sequence are scored highly if they are 
conserved in the reference group of orthologs, and 
overlap poorly with the residue type choice in 
the paralogs groups (such positions will also be 
termed functional determinants). The second 
model additionally requires conservation within 
each group of paralogs (functional discriminants). 
The scoring functions are phylogeny independent, 
but sensitive to the residue type similarity. The 
results are presented as a table of per-residue 
scores, and mapped onto related structure (when 
available) via browser-embedded visualization tool. 
They can also be downloaded as a spreadsheet 
table, and sessions for two additional molecular 
visualization tools. The database interface is avail- 
able at http://epsf.bmad.bii.a-star.edu.sg/cube/db/ 
html/home.html. 

INTRODUCTION 

Cube-DB is designed to answer the question: which 
residues in a protein, belonging to a family of human 



paralogs, are responsible for its functional specialization? 
Intuitively we expect that such residues should have the 
same type (that is, be conserved) in related species. 
Whether they should be conserved as different types 
across paralogs in the same species has been the subject 
of some debate (1-5). Cube-DB takes the position that 
once the functional shift has occurred the conservation 
is no longer expected, and reports residues that are well 
conserved in the protein of interest and different in 
paralogs, irrespective of their degree of conservation. 

Similar view was taken in FunShift (6). The authors 
of that 2005 compilation used a maximum likelihood 
method (7) to establish the rate of mutation across 
branches of a presumptive evolutionary tree. In contrast 
to this phylogeny-based approach, Cube-DB uses a 
tree-independent heuristic, to be discussed in the 
'Methods' section, to estimate both within-ortholog 
group conservation, and the overlap (or lack thereof) 
across different paralogs. 

SDR database (8), on the contrary, adheres to the view 
that the positions of functional importance should be 
conserved in all paralogs, an assumption that has repeat- 
edly been shown to work well for the catalytic sites of 
enzymes (9,10). While Cube-DB displays this type of in- 
formation side by side with the overall and group-specific 
conservation, it emphasizes the last characteristic — within 
group conservation — as a feature of practical import- 
ance in other (non-enzymatic) cases of functional diver- 
gence (11). 

While several servers (that is, web applications that 
generate the analysis on the fly) offer specialization 
analysis on the set of sequences provided by the user 
(12-14), we choose to simplify the process by providing 
sets of sequences that are known to be paralogous and to 
align well. To do so, we limit our attention to the sets 
of sequences for which this information is relatively 
straightforward to establish, but is of preeminent interest 
for biomedical applications: human families of paralogous 
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proteins and their conservation across vertebrate 
orthologs. 

Staying within the nomenclature associated with the 
human versions of proteins enables us also to design a 
straightforward and intuitive interface for browsing the 
database contents. Furthermore, the database offers a 
unique take-home way of presenting the results, in terms 
of downloadable spreadsheets and sessions for two 
popular molecular visualization tools. 



DATA PROVENANCE AND DATABASE SCOPE 

We organize our analysis around the nomenclature/ 
division into families provided by HUGO Gene 
Nomenclature Committee (15), but the results are 
equally valid for (and indeed based on) all vertebrate 
genomes currently available in Ensembl (16). By its 
design and purpose, the database is oriented toward com- 
parison of vertebrate paralogs. Working with full genomes 
enables us, by using a mutual-best-hit [a.k.a. BeT, the best 
hit (17) or bidirectional best hit (18)] strategy, to construct 
relatively complete and reliable sets of orthologs from all 
available species, and obtain balanced sets of sequences 
for all paralogs under consideration. By balanced here 
we mean 'covering a comparable taxonomical breadth.' 
The question of problematic alignments is sidestepped 
by limiting the analysis to clusters of paralogs with at 
least 40% sequence similarity (19). In its current edition, 
Cube-DB thus presents the results for 226 named groups 
of paralogs, divided into 600 clusters of alignable 
sequences. 



METHOD 

Assembling and aligning the relevant sequence set 

Cube-DB subdivides the list of human protein families 
provided by HUGO (13) into clusters of proteins with at 
least 40% sequence identity in at least 70% of their 
alignable (non-gap) length. The purpose is 2-fold: it elim- 
inates the problem of ambiguous alignments (19), and it 
helps divide the results into tractable chunks for presenta- 
tion. Different choices (and sizes) of groups to be 
compared are, of course, possible, and should at some 
point be available through the accompanying server. 

For human sequences belonging to a cluster, the 
orthologs from other vertebrate species from Ensembl 
(16) are retrieved by mutual-best-hit strategy. For a 
recent comparison of mutual-best-hit approach with 
other available options, see (20). The taxonomical 
content of the database is thus entirely determined by 
the vertebrate genomes currently (Release 64) deposited 
in Ensembl. 

When an ortholog is reported to be missing in the 
database of known (annotated) proteins from a genome, 
it is sought in the ab initio detected set of proteins. 

Each set of orthologs is aligned using Mafft (21) and the 
resulting alignments (corresponding to a single human 
paralog each) are then profile-aligned using the same 
program. 



Assigning relative conservation and specialization scores to 
each position in the alignment 

The algebraic expressions used to evaluate the scores 
described below can be found in the Supplementary 
Data. An extensive discussion can be found in (11). 

For each position in the overall alignment the conser- 
vation is scored on the [0, 1] scale. Similarly, for each pair 
of paralogous groups, the overlap in the amino acid type 
choice is turned into a quantity in the same range. In 
addition, the scoring functions are sensitive to similarity 
between the amino acid types — both conservation and 
overlap are measured from their expected values given 
the distribution of the amino acid types at a position, 
the overall variability in the alignment, and the average 
propensity of residue types to mutate into each other [see 
Supplementary Data and (11)]. 

These elementary scores are linearly combined into two 
different kinds of specialization scores, rewarding: (i) dis- 
criminants — positions that are conserved in each paralog 
as a different amino acid type, and (ii) determinants — pos- 
itions that, referring to a particular paralog, are conserved 
in that group, and different in the non-reference groups, 
irrespective of their conservation therein. 

For comparison, the overall conservation across all se- 
quences is calculated using a previously published method 
(16). Highly conserved positions do not coincide with the 
specific positions, and correspond to structural and func- 
tional features common to all paralogs in the cluster. 

Database organization 

Starting from the list of family names, which is small from 
a computational perspective, and clustering the paralogs 
by similarity, results in a shallow directory structure that 
can be quickly traversed without using a database man- 
agement software. All the result files corresponding to a 
cluster specified in the query are located therein directly, 
and returned to the user. 

RESULTS AND THEIR PRESENTATION 

Input 

The database can be browsed though alphabetically 
ordered HUGO nomenclature, or searched by protein 
name, using a simple string matching search. 

Output 

Per-residue results of estimated degree of conservation 
across families, as well as two different models of func- 
tional divergence [discriminants and determinants; also 
termed type I and type II (1), heterotachy and homotachy 
(2) in the molecular evolution literature] are presented in 
terms of a scrollable html table, and embedded Jmol (22) 
visualization tool, for the cases when the related structure 
is available. These results are also available for download 
in terms of an xls spreadsheet table, and sessions for two 
different protein visualization tools: Pymol (23) and 
Chimera (24), Figure 1. To keep the size of visualization 
sessions manageable, visualization for determinants of 
each paralogous group is presented as an individual 
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Figure 1. Result presentation in Cube-DB. (A) Several columns from the alignment of IFNAR2 and IFNGRl, members of interferon receptor 
family. (B) The region from downloadable spreadsheet, corresponding to the same region as shown in (A), collating the information about conser- 
vation (white-black-red colorbar) with the information about specialization (blue-orange colorbar). The "conservation" group of of columns shows 
conservation across all groups, as well as within each group of orthologs in the cluster. The 'specificity' group of column shows the scoring of 
discriminant behavior, which is a property of the cluster as a whole (see 'Methods' section), and determinant behavior, using each group as a 
reference in turn. (C) Visualization of conservation in IFNAR2, using downloadable Chimera (24) session. (D) Visualization of specificity deter- 
minants in IFNAR2, using downloadable Pymol (23) session. The information about the same group of residues is encircled in the table and on the 
visualization frame, to illustrate the correspondence of the color coding between the two. 



session. The alignments used in the analysis, as well as 
complete work directory are also available for download 
to interested users. Help pages are accessible from each 
page presented by the web. 

User's perspective: an example. We illustrate the database 
functionality on the example of a hypothetical user 
investigating the sources of functional difference between 
interferon-a receptor 2 (IFNAR2) and its cousin, 
interferon-y receptor 1 (IFNGRl). The reason we 
choose this example is that the thorough mutational 
study undertaken by Piehler and Schreiber (25) enables 
us to take a look at the results presented by the 
database in the retrospective light. 

The related analysis can be found by locating the IFNR 
(Interferon receptors) family on the Browse page, followed 
by narrowing the search down to 'cluster_2', a subgroup 



of the family consisting of two members: IFNAR2 and 
IFNGRl. The same page can be located through the 
Search window, using 'IFNAR2' or 'IFNGRl' as search 
terms. 

Since for our protein of interest the structure is 
available, the top of the results page shows side-by-side 
visualization of overall conservation and discriminant 
specialization scores, mapped onto the structure. 
Discriminant behavior here refers to positions that are 
conserved within a group of orthologs, but as a different 
residue type within each group. To keep the visual cue as 
clear as possible, we choose to use two different coloring 
schemes for the two properties (conservation and special- 
ization). The colorbars shown in Figure 1 also appear 
on the top of the results page. The color scales we 
choose - mainly because they have very little overlap — are 
white-black-red for conservation and blue-orange for 
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specialization. This 3D mapping of the results brings to 
the attention potential clustering of residues on one face of 
the protein, that otherwise might go unnoticed on the 
sequence. 

The same information, using the same color-coding, can 
be found in the table below the visualization windows. 
Therein the numerical value of the scores is also given. 
Additionally, the table columns contain the information 
about the within-group conservation and specialization 
scoring according to determinant model. This last type 
of scoring is reference group-dependent (see 'Methods' 
section), and therefore one 'determinant' column for 
each of the groups appears in the table. 

However, this kind of result presentation is limited by 
the browser's capabilities, and is, furthermore, inflexible 
and unmodifiable. A typical user will already have a 
sizable knowledge (public or proprietary) about the 
protein under investigation, that can extend and be 
compared with the conservation/specialization analysis. 
For that purpose the Downloads page for each cluster 
contains a number of downloadable files for use in a 
spreadsheet application or molecular visualization tools. 
Spreadsheet allows for a further modification of the 
results' presentation: the residues can be sorted according 
to any of the provided scores, as needed by the researcher. 

These files are further divided according to the underly- 
ing selection of sequences, which can cover all available 
vertebrates, or mammalian sequences only. They can 
be extended with additional information, and saved to 
be used as a reference. 

Figure 2 shows the correspondence between the muta- 
tional data of Piehler and Schreiber (25), and the spe- 
cialization scoring using mammalian sequences and 
determinant model of specialization (see 'Methods' 
section above), applied to IFNAR2. The upper panel of 
the figure shows as spheres the positions [see (11) and its 
supplementary material] found to be involved in the 
function specific for IFNAR2 (binding of interferon a 2, 
IFNa2 and interferon p, IFNP), which are mostly scored 
favorably (orange) by the scoring method. The lower 
panel shows the positions without functional impact on 
the binding between IFNAR2 and IFNa/p. They are 
scored unfavorably in the same scoring scheme. 

In a real-life scenario, the degree of involvement of 
individual residues in functional specialization is 
unknown, and is indeed the object of the study. 
Focusing on residues of strong specialization (orange) 
should help locate candidate regions of group-specific 
functional impact. Such residues will typically be found 
interspersed with conserved residues in the ordered 
pieces of secondary protein structure, or forming larger 
continuous stretches in disordered regions. 



CONCLUSION AND OUTLOOK 

Cube-DB offers a unique service for detecting residues 
responsible for functional specialization in human 
protein families. Its largest value lies in collating several 
scores — for conservation within and across several groups 
of paralogs, as well as divergence between them — and 




specific 
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Figure 2. Comparison of determinant scoring for IFNAR2 and the 
results of Piehler and Schreiber (25). The results are mapped on the 
structure of IFNAR2 determined by Nudelman et al. (26) (PDB iden- 
tifier 21ag). The positions tested in the experiment are shown as spheres. 
The coloring scheme corresponds to determinant model scoring, as 
implemented in Cube-DB, and applied to mamalian orthologues of 
the two proteins. (A) The positions shown in experiment [(25), Table 
2.] to be involved in a function specific to IFNAR2 (binding of inter- 
feron a 2, IFNoc2 and interferon p, IFNP). (B) The positions shown in 
experiment not to be involved in a function specific to IFNAR2. 



presenting them in a form that can be downloaded and 
extended with further annotation by the user. Beyond 
doubt, the result presentation can be elaborated and 
improved on in several ways, for example by linking dy- 
namically the (so far physically independent) visualiza- 
tions for the alignments, the scores, and their mapping 
onto the structure. We hope to return to this possibility 
in one of the subsequent versions of the database. Also, 
the limitation to the protein families currently recognized 
by the community consensus will be amended in the future 
by case-specific extensions of the database, and through 
the (planned) accompanying server. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary References [1-3]. 
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